Empirical evaluation of the link and content-based focused Treasure-Crawler

نویسندگان

  • Ali Seyfi
  • Ahmed Patel
  • Joaquim Celestino
چکیده

Indexing the Web is becoming a laborious task for search engines as the Web exponentially grows in size and distribution. Presently, the most effective known approach to overcome this problem is the use of focused crawlers. A focused crawler applies a proper algorithm in order to detect the pages on the Web that relate to its topic of interest. For this purpose we proposed a custom method that uses specific HTML elements of a page to predict the topical focus of all the pages that have an unvisited link within the current page. These recognized on-topic pages have to be sorted later based on their relevance to the main topic of the crawler for further actual downloads. In the Treasure-Crawler, we use a hierarchical structure called the T-Graph which is an exemplary guide to assign appropriate priority score to each unvisited link. These URLs will later be downloaded based on this priority. This paper outlines the architectural design and embodies the implementation, test results and performance evaluation of the Treasure-Crawler system. The Treasure-Crawler is evaluated in terms of information retrieval criteria such as recall and precision, both with values close to 0.5. Gaining such outcome asserts the significance of the proposed approach.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Focused Crawler Combinatory Link and Content Model Based on T-Graph Principles

The two significant tasks of a focused Web crawler are finding relevant topic-specific documents on the Web and analytically prioritizing them for later effective and reliable download. For the first task, we propose a sophisticated custom algorithm to fetch and analyze the most effective HTML structural elements of the page as well as the topical boundary and anchor text of each unvisited link...

متن کامل

Prioritize the ordering of URL queue in Focused crawler

The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...

متن کامل

Improving the performance of focused web crawlers

This work addresses issues related to the design and implementation of focused crawlers. Several variants of state-of-the-art crawlers relying on web page content and link information for estimating the relevance of web pages to a given topic are proposed. Particular emphasis is given to crawlers capable of learning not only the content of relevant pages (as classic crawlers do) but also paths ...

متن کامل

Navigating the Small World Web by Textual Cues

Can a Web crawler efficiently locate an unknown relevant page? While this question is receiving much empirical attention due to its considerable commercial value in the search engine community, theoretical efforts to bound the performance of focused navigation have only exploited the link structure of the Web graph, neglecting other features. Here I investigate the connection between linkage an...

متن کامل

PDD Crawler: A focused web crawler using link and content analysis for relevance prediction

Majority of the computer or mobile phone enthusiasts make use of the web for searching activity. Web search engines are used for the searching; The results that the search engines get are provided to it by a software module known as the Web Crawler. The size of this web is increasing round-the-clock. The principal problem is to search this huge database for specific information. To state whethe...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Computer Standards & Interfaces

دوره 44  شماره 

صفحات  -

تاریخ انتشار 2016